8 research outputs found

    Spam-T5: Benchmarking Large Language Models for Few-Shot Email Spam Detection

    Full text link
    This paper investigates the effectiveness of large language models (LLMs) in email spam detection by comparing prominent models from three distinct families: BERT-like, Sentence Transformers, and Seq2Seq. Additionally, we examine well-established machine learning techniques for spam detection, such as Na\"ive Bayes and LightGBM, as baseline methods. We assess the performance of these models across four public datasets, utilizing different numbers of training samples (full training set and few-shot settings). Our findings reveal that, in the majority of cases, LLMs surpass the performance of the popular baseline techniques, particularly in few-shot scenarios. This adaptability renders LLMs uniquely suited to spam detection tasks, where labeled samples are limited in number and models require frequent updates. Additionally, we introduce Spam-T5, a Flan-T5 model that has been specifically adapted and fine-tuned for the purpose of detecting email spam. Our results demonstrate that Spam-T5 surpasses baseline models and other LLMs in the majority of scenarios, particularly when there are a limited number of training samples available. Our code is publicly available at https://github.com/jpmorganchase/emailspamdetection

    An Unsupervised Method for Estimating Class Separability of Datasets with Application to LLMs Fine-Tuning

    Full text link
    This paper proposes an unsupervised method that leverages topological characteristics of data manifolds to estimate class separability of the data without requiring labels. Experiments conducted in this paper on several datasets demonstrate a clear correlation and consistency between the class separability estimated by the proposed method with supervised metrics like Fisher Discriminant Ratio~(FDR) and cross-validation of a classifier, which both require labels. This can enable implementing learning paradigms aimed at learning from both labeled and unlabeled data, like semi-supervised and transductive learning. This would be particularly useful when we have limited labeled data and a relatively large unlabeled dataset that can be used to enhance the learning process. The proposed method is implemented for language model fine-tuning with automated stopping criterion by monitoring class separability of the embedding-space manifold in an unsupervised setting. The proposed methodology has been first validated on synthetic data, where the results show a clear consistency between class separability estimated by the proposed method and class separability computed by FDR. The method has been also implemented on both public and internal data. The results show that the proposed method can effectively aid -- without the need for labels -- a decision on when to stop or continue the fine-tuning of a language model and which fine-tuning iteration is expected to achieve a maximum classification performance through quantification of the class separability of the embedding manifold

    Détection d'intrusion réseau par anomalies avec apprentissage automatique

    No full text
    In recent years, hacking has become an industry unto itself, increasing the number and diversity of cyber attacks. Threats on computer networks range from malware to denial of service attacks, phishing and social engineering. An effective cyber security plan can no longer rely solely on antiviruses and firewalls to counter these threats: it must include several layers of defence. Network-based Intrusion Detection Systems (IDSs) are a complementary means of enhancing security, with the ability to monitor packets from OSI layer 2 (Data link) to layer 7 (Application). Intrusion detection techniques are traditionally divided into two categories: signatured-based (or misuse) detection and anomaly detection. Most IDSs in use today rely on signature-based detection; however, they can only detect known attacks. IDSs using anomaly detection are able to detect unknown attacks, but are unfortunately less accurate, which generates a large number of false alarms. In this context, the creation of precise anomaly-based IDS is of great value in order to be able to identify attacks that are still unknown.In this thesis, machine learning models are studied to create IDSs that can be deployed in real computer networks. Firstly, a three-step optimization method is proposed to improve the quality of detection: 1/ data augmentation to rebalance the dataset, 2/ parameters optimization to improve the model performance and 3/ ensemble learning to combine the results of the best models. Flows detected as attacks can be analyzed to generate signatures to feed signature-based IDS databases. However, this method has the disadvantage of requiring labelled datasets, which are rarely available in real-life situations. Transfer learning is therefore studied in order to train machine learning models on large labeled datasets, then finetune them on benign traffic of the network to be monitored. This method also has flaws since the models learn from already known attacks, and therefore do not actually perform anomaly detection. Thus, a new solution based on unsupervised learning is proposed. It uses network protocol header analysis to model normal traffic behavior. Anomalies detected are then aggregated into attacks or ignored when isolated. Finally, the detection of network congestion is studied. The bandwidth utilization between different links is predicted in order to correct issues before they occur.Ces dernières années, le piratage est devenu une industrie à part entière, augmentant le nombre et la diversité des cyberattaques. Les menaces qui pèsent sur les réseaux informatiques vont des logiciels malveillants aux attaques par déni de service, en passant par le phishing et l'ingénierie sociale. Un plan de cybersécurité efficace ne peut plus reposer uniquement sur des antivirus et des pare-feux pour contrer ces menaces : il doit inclure plusieurs niveaux de défense. Les systèmes de détection d'intrusion (IDS) réseaux sont un moyen complémentaire de renforcer la sécurité, avec la possibilité de surveiller les paquets de la couche 2 (liaison) à la couche 7 (application) du modèle OSI. Les techniques de détection d'intrusion sont traditionnellement divisées en deux catégories : la détection par signatures et la détection par anomalies. La plupart des IDS utilisés aujourd'hui reposent sur la détection par signatures ; ils ne peuvent cependant détecter que des attaques connues. Les IDS utilisant la détection par anomalies sont capables de détecter des attaques inconnues, mais sont malheureusement moins précis, ce qui génère un grand nombre de fausses alertes. Dans ce contexte, la création d'IDS précis par anomalies est d'un intérêt majeur pour pouvoir identifier des attaques encore inconnues.Dans cette thèse, les modèles d'apprentissage automatique sont étudiés pour créer des IDS qui peuvent être déployés dans de véritables réseaux informatiques. Tout d'abord, une méthode d'optimisation en trois étapes est proposée pour améliorer la qualité de la détection : 1/ augmentation des données pour rééquilibrer les jeux de données, 2/ optimisation des paramètres pour améliorer les performances du modèle et 3/ apprentissage ensembliste pour combiner les résultats des meilleurs modèles. Les flux détectés comme des attaques peuvent être analysés pour générer des signatures afin d'alimenter les bases de données d'IDS basées par signatures. Toutefois, cette méthode présente l'inconvénient d'exiger des jeux de données étiquetés, qui sont rarement disponibles dans des situations réelles. L'apprentissage par transfert est donc étudié afin d'entraîner des modèles d'apprentissage automatique sur de grands ensembles de données étiquetés, puis de les affiner sur le trafic normal du réseau à surveiller. Cette méthode présente également des défauts puisque les modèles apprennent à partir d'attaques déjà connues, et n'effectuent donc pas réellement de détection d'anomalies. C'est pourquoi une nouvelle solution basée sur l'apprentissage non supervisé est proposée. Elle utilise l'analyse de l'en-tête des protocoles réseau pour modéliser le comportement normal du trafic. Les anomalies détectées sont ensuite regroupées en attaques ou ignorées lorsqu'elles sont isolées. Enfin, la détection la congestion réseau est étudiée. Le taux d'utilisation de la bande passante entre les différents liens est prédit afin de corriger les problèmes avant qu'ils ne se produisent

    Anomaly Detection in Vehicle-to-Infrastructure Communications

    No full text
    International audienceThis paper presents a neural network-based anomaly detection system for vehicular communications. The proposed system is able to detect in-vehicle data tampering in order to avoid the transmission of bogus or harmful information. We investigate the use of Long Short-term Memory (LSTM) and Multilayer Perceptron (MLP) neural networks to build two prediction models. For each model, an efficient architecture is designed based on appropriate hardware requirements. Then, a comparative performance analysis is provided to recommend the most efficient neural network model. Finally, a set of metrics are selected to show the accuracy of the proposed detection system under several types of security attacks

    Unsupervised protocol-based intrusion detection for real-world networks

    No full text
    International audienceAnomaly-based Intrusion Detection Systems (IDSs) are rarely deployed in real networks, because of their high false positive rate. Their ability to detect unknown attacks is, however, very valuable in a context where new threats are emerging almost daily. This paper presents an unsupervised anomaly-based intrusion detection solution focused on protocol headers analysis. This approach is tested on a recent and realistic dataset (CICIDS2017) over a 4-day period. Each protocol is converted to a set of normalized numeric features, which are processed by 5 neural network architectures: deep autoencoders, deep MLPs, LSTMs, BiLSTMs, and GANs. The output of these algorithms is an anomaly score, which is normalized and combined with the anomaly scores of other protocols. We argue that this classification problem is very different from the actual problem of intrusion detection and requires new metrics. In particular, packet anomaly scores must be refined in a post-processing step to aggregate anomalies into continuous attacks. This approach successfully detects 7 out of 11 attacks not seen during the training phase, without any false positives. It is thus possible to consider deployments in real-world networks of such IDSs, capable of reliably detecting zero-day attacks

    A cascade-structured meta-specialists approach for neural network-based intrusion detection

    No full text
    International audienceAn ensemble learning approach for classification in intrusion detection is proposed. Its application to the KDD Cup 99 and NSL-KDD datasets consistently increases the classification accuracy compared to previous techniques. The cascade-structured meta-specialists architecture is based on a three-step optimization method: data augmentation, hyperparameters optimization and ensemble learning. Classifiers are first created with a strong specialization in each specific class. These specialists are then combined to form meta-specialists, more accurate than the best classifiers that compose them. Finally, meta-specialists are arranged in a cascading architecture where each classifier is successively given the opportunity to recognize its own class. This method is particularly useful for datasets where training and test sets differ greatly, as in this case. The cascade-structured meta-specialists approach achieved a very high classification accuracy (94.44% on KDD Cup 99 test set and 88.39% on NSL-KDD test set) with a low false positive rate (0.33% and 1.94% respectively)
    corecore